Introduction to Anomaly Detection in Time Series with Keras


What is Anomaly Detection?

Anomaly detection is often referred to as outlier detection, anomaly detection is essentially the method of detection and recognition of anomalous data in any data-based occurrence or finding that is fundamentally different from the rest of the data. Anomalous data in the form of financial fraud, stock prices, or even medical problems may be crucial in identifying a unusual data trend or potential problem

Business Benefits:

Future advances in machine learning and deep learning technology can only add to the variety of techniques for identifying anomalies and their usefulness for business data. The growing volume and sophistication of the data translate into significant opportunities for business success in harnessing this knowledge.

In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
from matplotlib.pylab import rcParams
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

sns.set(style='whitegrid', palette='muted')
rcParams['figure.figsize'] = 14, 8
np.random.seed(1)
tf.random.set_seed(1)

print('Tensorflow version:', tf.__version__)
Tensorflow version: 2.0.0

Load and Inspect the S&P 500 Index Data

In [2]:
df = pd.read_csv('S&P_500_Index_Data.csv', parse_dates=['date'])
df.head()
Out[2]:
date close
0 1986-01-02 209.59
1 1986-01-03 210.88
2 1986-01-06 210.65
3 1986-01-07 213.80
4 1986-01-08 207.97
In [3]:
df.shape
Out[3]:
(8192, 2)
In [4]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.date, y=df.close,
                    mode='lines',
                    name='close'))
fig.update_layout(showlegend=True)
fig.show()

Data Preprocessing

In [5]:
train_size = int(len(df) * 0.8)
test_size = len(df) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(df)]
print(train.shape, test.shape)
(6553, 2) (1639, 2)
In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(train[['close']])

train['close'] = scaler.transform(train[['close']])
test['close'] = scaler.transform(test[['close']])

Create Training and Test Splits

In [7]:
def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].values
        Xs.append(v)        
        ys.append(y.iloc[i + time_steps])
    return np.array(Xs), np.array(ys)
In [8]:
time_steps = 30

X_train, y_train = create_dataset(train[['close']], train.close, time_steps)
X_test, y_test = create_dataset(test[['close']], test.close, time_steps)

print(X_train.shape)
(6523, 30, 1)

Build an LSTM Autoencoder

In [9]:
timesteps = X_train.shape[1]
num_features = X_train.shape[2]
In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, RepeatVector, TimeDistributed

model = Sequential([
    LSTM(128, input_shape=(timesteps, num_features)),
    Dropout(0.2),
    RepeatVector(timesteps),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    TimeDistributed(Dense(num_features))                 
])

model.compile(loss='mae', optimizer='adam')
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 128)               66560     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
repeat_vector (RepeatVector) (None, 30, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 30, 128)           131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 128)           0         
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 1)             129       
=================================================================
Total params: 198,273
Trainable params: 198,273
Non-trainable params: 0
_________________________________________________________________

Train the Autoencoder

In [11]:
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min')
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks = [es],
    shuffle=False
)
Train on 5870 samples, validate on 653 samples
Epoch 1/100
5870/5870 [==============================] - 37s 6ms/sample - loss: 0.1625 - val_loss: 0.1610
Epoch 2/100
5870/5870 [==============================] - 28s 5ms/sample - loss: 0.1114 - val_loss: 0.0986
Epoch 3/100
5870/5870 [==============================] - 28s 5ms/sample - loss: 0.0903 - val_loss: 0.0443
Epoch 4/100
5870/5870 [==============================] - 34s 6ms/sample - loss: 0.0802 - val_loss: 0.0442
Epoch 5/100
5870/5870 [==============================] - 33s 6ms/sample - loss: 0.0717 - val_loss: 0.0618
Epoch 6/100
5870/5870 [==============================] - 33s 6ms/sample - loss: 0.0774 - val_loss: 0.0326
Epoch 7/100
5870/5870 [==============================] - 33s 6ms/sample - loss: 0.0751 - val_loss: 0.0309
Epoch 8/100
5870/5870 [==============================] - 34s 6ms/sample - loss: 0.0745 - val_loss: 0.0593
Epoch 9/100
5870/5870 [==============================] - 33s 6ms/sample - loss: 0.0759 - val_loss: 0.0543
Epoch 10/100
5870/5870 [==============================] - 33s 6ms/sample - loss: 0.0766 - val_loss: 0.0307
Epoch 11/100
5870/5870 [==============================] - 35s 6ms/sample - loss: 0.0734 - val_loss: 0.0592
Epoch 12/100
5870/5870 [==============================] - 34s 6ms/sample - loss: 0.0748 - val_loss: 0.0632
Epoch 13/100
5870/5870 [==============================] - 35s 6ms/sample - loss: 0.0751 - val_loss: 0.0270
Epoch 14/100
5870/5870 [==============================] - 34s 6ms/sample - loss: 0.0744 - val_loss: 0.0393
Epoch 15/100
5870/5870 [==============================] - 34s 6ms/sample - loss: 0.0740 - val_loss: 0.0440
Epoch 16/100
5870/5870 [==============================] - 35s 6ms/sample - loss: 0.0762 - val_loss: 0.0278

Plot Metrics and Evaluate the Model

In [12]:
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend();
In [13]:
X_train_pred = model.predict(X_train)

train_mae_loss = pd.DataFrame(np.mean(np.abs(X_train_pred - X_train), axis=1), columns=['Error'])
In [15]:
sns.distplot(train_mae_loss, bins=50, kde=True);
In [16]:
X_test_pred = model.predict(X_test)

test_mae_loss = np.mean(np.abs(X_test_pred - X_test), axis=1)
In [17]:
sns.distplot(test_mae_loss, bins=50, kde=True);

Detect Anomalies in the S&P 500 Index Data

In [18]:
THRESHOLD = 0.65

test_score_df = pd.DataFrame(test[time_steps:])
test_score_df['loss'] = test_mae_loss
test_score_df['threshold'] = THRESHOLD
test_score_df['anomaly'] = test_score_df.loss > test_score_df.threshold
test_score_df['close'] = test[time_steps:].close
In [19]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=test[time_steps:].date, y=test_score_df.loss,
                    mode='lines',
                    name='Test Loss'))
fig.add_trace(go.Scatter(x=test[time_steps:].date, y=test_score_df.threshold,
                    mode='lines',
                    name='Threshold'))
fig.update_layout(showlegend=True)
fig.show()
In [20]:
anomalies = test_score_df[test_score_df.anomaly == True]
anomalies.head()
Out[20]:
date close loss threshold anomaly
7474 2015-08-25 2.457439 0.653680 0.65 True
7475 2015-08-26 2.632149 0.709780 0.65 True
8090 2018-02-05 4.329949 0.658895 0.65 True
8091 2018-02-06 4.440671 0.847428 0.65 True
8092 2018-02-07 4.408365 0.823429 0.65 True
In [21]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=test[time_steps:].date, y=scaler.inverse_transform(test[time_steps:].close),
                    mode='lines',
                    name='Close Price'))
fig.add_trace(go.Scatter(x=anomalies.date, y=scaler.inverse_transform(anomalies.close),
                    mode='markers',
                    name='Anomaly'))
fig.update_layout(showlegend=True)
fig.show()